Todos:

Context

Although only about 15% of all waste within the EU is generated as municipal waste1, the absolute figures pose a major problem for municipalities, waste management companies and the environment. 225.7 million tonnes of municipal waste were collected in the EU in 2020, of which only 68 million tonnes were directly recycled, with the remainder going into long-term landfill or being incinerated for energy generation. In view of the climate-damaging landfill gases produced during storage or CO2 emissions during incineration, combined with the problem of the large amount of space required, the EU’s goal is to constantly optimise its waste management. This is intended to promote the production of less waste, a stronger circular economy and the economic efficiency of waste management. In the context of this optimisation, we want to work out a status quo of municipal waste management in Italian municipalities, on which subsequent optimisation projects can build. For this purpose, we base our work on a data set on the waste management of a total of 4341 Italian municipalities. With the help of these data, we are to draw up profiles of the municipalities, which we can cluster them with regard to their descriptive characteristics, in particular the key figures of waste management, but also geographical and economic factors.

Exploratory Data Analysis

Get an overview of the data set.

wm_df <- load2("data/waste_management.RData")
skimr::skim(wm_df)
Data summary
Name wm_df
Number of rows 4341
Number of columns 36
_______________________
Column type frequency:
character 5
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1 5 21 0 20 0
Provinz 0 1 4 21 0 102 0
Gemeinde 6 1 2 60 0 4333 0
Gebuehrenregelung 0 1 4 8 0 2 0
Region_PAYT 0 1 2 4 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 47469.93 30089.80 1272.00 18135.00 42015.00 70049.00 111107.00 ▇▃▅▃▃
Flaeche 6 1.00 41.00 56.78 0.12 10.85 22.73 47.49 1287.39 ▇▁▁▁▁
Bevoelkerung 0 1.00 10203.84 53426.40 34.00 1579.00 3535.00 8199.00 2617175.00 ▇▁▁▁▁
Bevoelkerungsdichte 6 1.00 405.05 771.21 2.48 62.59 151.32 399.36 12122.83 ▇▁▁▁▁
Strassen 443 0.90 101.93 309.99 1.00 25.00 51.00 105.00 14970.00 ▇▁▁▁▁
Inselgemeinde 6 1.00 0.01 0.07 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
Kuestengemeinde 6 1.00 0.17 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
Urbanisierungsgrad 6 1.00 2.49 0.59 1.00 2.00 3.00 3.00 3.00 ▁▁▆▁▇
Geologischer_Indikator 285 0.93 2.29 0.89 1.00 1.00 3.00 3.00 3.00 ▃▁▂▁▇
Abfaelle_gesamt 0 1.00 5.31 32.54 0.02 0.61 1.52 3.95 1691.89 ▇▁▁▁▁
Abfaelle_sortiert 0 1.00 3.25 15.62 0.00 0.37 1.04 2.73 765.13 ▇▁▁▁▁
Abfaelle_unsortiert 0 1.00 2.04 17.64 0.01 0.18 0.41 1.06 926.76 ▇▁▁▁▁
Sortierungsgrad 0 1.00 60.11 19.80 0.00 44.26 64.34 76.46 97.48 ▁▃▅▇▅
Sort_Bio 511 0.88 22.28 12.75 0.01 11.13 24.98 31.84 61.64 ▅▅▇▂▁
Sort_Papier 25 0.99 10.96 3.88 0.00 8.66 10.88 13.06 45.29 ▃▇▁▁▁
Sort_Glas 33 0.99 9.41 3.71 0.00 7.15 9.10 11.28 39.84 ▅▇▁▁▁
Sort_Holz 1095 0.75 4.11 2.72 0.00 2.08 4.02 5.71 25.12 ▇▃▁▁▁
Sort_Metall 246 0.94 1.76 1.35 0.00 0.88 1.54 2.35 20.67 ▇▁▁▁▁
Sort_Plastik 39 0.99 6.11 3.26 0.00 4.13 5.79 7.55 31.60 ▇▆▁▁▁
Sort_Elektrik 314 0.93 1.23 0.82 0.00 0.78 1.18 1.57 17.95 ▇▁▁▁▁
Sort_Textil 1013 0.77 0.76 0.69 0.00 0.35 0.63 0.99 10.58 ▇▁▁▁▁
Sort_Rest 136 0.97 7.94 5.15 0.03 3.96 7.13 11.13 37.16 ▇▆▂▁▁
Verwendung_Energie 0 1.00 20.28 15.68 0.00 5.63 18.54 38.50 55.12 ▇▃▂▇▁
Verwendung_Deponie 0 1.00 17.95 19.46 0.00 4.55 11.29 31.49 76.69 ▇▁▂▁▁
Verwendung_Recycling 0 1.00 41.29 12.84 2.35 32.68 43.23 51.57 76.69 ▁▅▇▇▁
Verwendung_Unbekannt 0 1.00 20.47 17.82 0.00 5.21 17.77 30.98 97.38 ▇▅▂▁▁
Steuern_gewerblich 386 0.91 16564.97 14131.50 4179.66 9079.69 12464.90 19413.38 377492.21 ▇▁▁▁▁
Steuern_privat 285 0.93 13210.62 3648.87 2606.01 10161.97 13667.81 15758.65 35769.54 ▂▇▃▁▁
Kosten_Basis 0 1.00 154.24 76.07 25.69 108.04 136.62 179.16 977.42 ▇▁▁▁▁
Kosten_Sortierung 67 0.98 52.68 33.06 3.39 31.25 48.88 66.44 582.16 ▇▁▁▁▁
Kosten_sonstiges 52 0.99 54.18 43.19 4.27 27.34 41.69 66.49 670.32 ▇▁▁▁▁
wm_df %>%
  complete.cases() %>%
  sum()
> [1] 2018

There are quite a lot of missing values. Omitting all of them would mean a loss of more than 50 % of the data. Create some plots to enhance the understanding of the data set:

wm_df %>% na.omit %>% 
  ggplot(aes(x=Region, y=Abfaelle_gesamt)) +
  ggtitle("Waste by Region") +
  geom_boxplot(aes(fill = Region), outlier.shape = 2,
               outlier.colour = "black",
               outlier.alpha = .5) +
  theme(aspect.ratio = 0.5) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5))

A lot of outliers in total amounts of waste per community. I am going to take care of them later. Til then they will be hidden, so other values appear less compressed.

wm_df %>% na.omit %>% 
  ggplot(aes(x=Region, y=Abfaelle_gesamt)) +
  ggtitle("Waste by Region") +
  geom_boxplot(aes(fill = Region), outlier.shape = NA) +
  theme(aspect.ratio = 0.3) +
  coord_flip() +
  coord_fixed(ylim = c(0, 29)) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5))

wm_df %>% na.omit %>% 
  mutate(Geologischer_Indikator = ifelse(Geologischer_Indikator == 1, "South", ifelse(Geologischer_Indikator == 2, "Middle", "North"))) %>% 
  ggplot(aes(x=Geologischer_Indikator, y=Abfaelle_gesamt)) +
  ggtitle("Waste by geological location") +
  geom_boxplot(aes(fill = Geologischer_Indikator), outlier.shape = NA) +
  coord_cartesian(ylim = quantile(wm_df$Abfaelle_gesamt, c(0, 0.97))) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5))

wm_df %>%
  na.omit %>%
  mutate(`Urbanisation deg.` =
           ifelse(Urbanisierungsgrad == 1, "low",
                  ifelse(Urbanisierungsgrad == 2, "mid",
                         "high")),
         Population = Bevoelkerung,
         `Total waste (kt)` = Abfaelle_gesamt) %>% 
  ggplot(aes(text=paste0("Municipality: ", Gemeinde), y=`Total waste (kt)`)) +
  ggtitle("Waste by population & urbanisation degree") +
  geom_point(aes(x=Population, colour = `Urbanisation deg.`)) +
  theme(plot.title = element_text(hjust = 0.5)) -> p_waste_pop_urb
ggplotly(p_waste_pop_urb)
wm_df %>% na.omit %>% 
  mutate(Urbanisierungsgrad = ifelse(Urbanisierungsgrad == 1, "low urbanisation deg.",
                                     ifelse(Urbanisierungsgrad == 3, "high urbanisation deg.",
                                            "mid urbanisation deg."))) %>% 
  ggplot(aes(x=Urbanisierungsgrad, y=Abfaelle_gesamt)) +
  ggtitle("Mean waste by urbanisation degree") +
  stat_summary(aes(group = Urbanisierungsgrad, fill = Urbanisierungsgrad), fun = mean, geom = "bar") +
  labs(y = "mean(Abfaelle_gesamt)") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5))

Inspecting missing values within the waste sorting related columns:

sapply(wm_df[,16:25],
       function(y) sum(length(which(is.na(y))))) %>%
  tibble(Column = names(.),
         NA_Count = .)
> # A tibble: 10 × 2
>    Column          NA_Count
>    <chr>              <int>
>  1 Sortierungsgrad        0
>  2 Sort_Bio             511
>  3 Sort_Papier           25
>  4 Sort_Glas             33
>  5 Sort_Holz           1095
>  6 Sort_Metall          246
>  7 Sort_Plastik          39
>  8 Sort_Elektrik        314
>  9 Sort_Textil         1013
> 10 Sort_Rest            136
options(width = 200)
Sort_NAs <- c()
for (i in 17:25) {
  nas_t <- which(is.na(wm_df[i]))
  Sort_NAs <- c(Sort_NAs, nas_t)
}

Sort_NAs_table <- Sort_NAs %>% table %>% sort(decreasing = T)
Sort_NAs_table[1:5] %>% names %>% as.numeric -> t

wm_df[t,16:25]
>      Sortierungsgrad Sort_Bio Sort_Papier Sort_Glas Sort_Holz Sort_Metall Sort_Plastik Sort_Elektrik Sort_Textil Sort_Rest
> 429             0.72       NA        0.72        NA        NA          NA           NA            NA          NA        NA
> 3572            0.70       NA          NA        NA        NA          NA           NA           0.7          NA        NA
> 4175            0.00       NA          NA        NA        NA          NA           NA           0.0          NA        NA
> 3777           59.32       NA          NA     39.84        NA          NA           NA            NA          NA     19.48
> 4017            1.87       NA          NA        NA        NA          NA         0.62            NA          NA      1.25

Looks like the sum of the values present in these waste sorting columns equals the value in Sortierungsgrad. Imputing any values here would completely destroy the logic behind these columns. I would argue that dropping all these values is a viable option. However, one could also say that replacing these NAs with zeros instead might work out as well. The latter option is the one I chose.

Dimension Reduction

Within the data set there are dimensions that hold no value when it comes to any analyses. ID is a unique identifier, Gemeinde the name of a community, Strassen contains more than 10 % missing values, Region and Provinz contain too many unique values that would complicate the process a lot. Also the importance of the information they hold is questionable.

cols_to_exclude <- c("ID",
                     "Gemeinde",
                     "Strassen",
                     "Region",
                     "Provinz")

A vector containing the names of the columns mentioned is created. A recipe from the tidyverse is used to remove these dimensions, replace missing values via bag imputation and replace outliers in numeric dimension with the IQR limits of their individual column and replace the remaining nominal columns with dummy variables.

Note: step_impute_constant() and step_outliers_iqr_to_limits() are functions from the steffanossaR package found at https://github.com/steffanossa/steffanossaR.

recipe_prep <- recipe(~., data = wm_df) %>% 
  step_rm(all_of(cols_to_exclude)) %>% 
  step_impute_constant(contains("Sort_"), constant = 0) %>% 
  step_impute_bag(all_numeric()) %>% 
  step_outliers_iqr_to_limits(all_numeric(), -ends_with("gemeinde")) %>% 
  step_other(all_nominal(), threshold = .001, other = NA) %>% 
  step_naomit(everything(), skip = T) %>% 
  step_dummy(all_nominal()) %>% 
  step_zv(everything())

Principal Components Analysis (PCA)

A PCA is used to reduce the number of variables by finding principal components of the data, which are new and uncorrelated variables that can explain the most variance in the original data.

First, the Bartlett Test is done. The Bartlett Test verifies the null hypothesis that the correlation matrix is equal to the identity matrix, meaning that the variables of a given data set are uncorrelated. If the resulting value is below .05, the null hypothesis is rejected and it is concluded, that the variables are correlated2.

psych::cortest.bartlett(cor(wm_df_prepped), n = 100)
> $chisq
> [1] 2562.32
> 
> $p.value
> [1] 4.570546e-286
> 
> $df
> [1] 465

The value is way below 0.05, there is correlation between the dimensions of the data set.

Next, the Kaiser-Mayer-Olkin Criterion (KMO) is looked at. The KMO measures the adequacy of a data set for factor analysis. It ranges from 0 to 1, where a higher value indicates higher suitability. A value above .6 is generally considered to be the threshold. However, some sources also consider .5 to be acceptable3.

psych::KMO(wm_df_prepped)$MSA
> [1] 0.5683696

0.5683696 is not very good but I will consider this acceptable. Now the PCA can be executed.

options(width = 90)
wm_df_pca <- 
  wm_df_prepped %>% 
  prcomp(scale. = T,
         center = T)
wm_df_pca %>%
  summary()
> Importance of components:
>                           PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8
> Standard deviation     2.5701 2.3620 1.59236 1.34903 1.25906 1.15444 1.03782 1.02187
> Proportion of Variance 0.2131 0.1800 0.08179 0.05871 0.05114 0.04299 0.03474 0.03368
> Cumulative Proportion  0.2131 0.3931 0.47484 0.53355 0.58468 0.62767 0.66242 0.69610
>                            PC9   PC10    PC11    PC12    PC13    PC14    PC15    PC16
> Standard deviation     0.96945 0.9563 0.91464 0.88193 0.85826 0.80272 0.79652 0.76392
> Proportion of Variance 0.03032 0.0295 0.02699 0.02509 0.02376 0.02079 0.02047 0.01883
> Cumulative Proportion  0.72642 0.7559 0.78290 0.80799 0.83176 0.85254 0.87301 0.89183
>                           PC17   PC18    PC19    PC20    PC21    PC22    PC23   PC24
> Standard deviation     0.74174 0.6909 0.64882 0.60965 0.55892 0.54515 0.47316 0.4524
> Proportion of Variance 0.01775 0.0154 0.01358 0.01199 0.01008 0.00959 0.00722 0.0066
> Cumulative Proportion  0.90958 0.9250 0.93856 0.95055 0.96062 0.97021 0.97743 0.9840
>                           PC25    PC26    PC27    PC28    PC29    PC30    PC31
> Standard deviation     0.43156 0.38944 0.32427 0.17288 0.09757 0.08770 0.06888
> Proportion of Variance 0.00601 0.00489 0.00339 0.00096 0.00031 0.00025 0.00015
> Cumulative Proportion  0.99004 0.99494 0.99833 0.99929 0.99960 0.99985 1.00000

Taking a look at the first 15 principal components (PC) and the percentage of variance they explain.

wm_df_pca %>%
  factoextra::fviz_eig(ncp = 15,
                       addlabels = T)

The elbow method would suggest using three PCs. The cumulative variance of 49.942 % when taking only three is too little to result in useful outcomes.

Another method of evaluating the number of PCs to keep is the Kaiser Criterion which states that factors with an eigenvalue above 1 are considered important and should be retained. An eigenvalue above 1 means its factor explains more variance than a single variable would.

factoextra::get_eig(wm_df_pca) %>%
  filter(eigenvalue > 1)
>       eigenvalue variance.percent cumulative.variance.percent
> Dim.1   6.605264        21.307304                    21.30730
> Dim.2   5.579191        17.997390                    39.30469
> Dim.3   2.535613         8.179398                    47.48409
> Dim.4   1.819884         5.870593                    53.35469
> Dim.5   1.585226         5.113631                    58.46832
> Dim.6   1.332728         4.299123                    62.76744
> Dim.7   1.077079         3.474448                    66.24189
> Dim.8   1.044212         3.368424                    69.61031

8 factors possess eigenvalues above 1 with a cumulative variance of 69.61 %.

get_eig(wm_df_pca)[1:15,] %>% 
  ggplot(aes(y=eigenvalue, x=1:15)) +
  ggtitle("Eigenvalue by Component") +
  labs(x = "Component",
       y = "Eigenvalue") +
  geom_line() +
  geom_point(col = "blue") +
  scale_x_continuous(breaks = 1:15) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  theme(plot.title = element_text(hjust = 0.5))

A third approach is Horn’s Method. Here random data sets with equal size (columns and rows) as the original data set are generated and then a factor analysis is performed on each of them. The retained number of factors are then compared. The idea is that if the number of factors kept in the original data set is similar to the number of factors kept in the random sets, the factors of the original data set are considered not meaningful. If the number of factors of the original data set is larger than the number of factors in the random sets, the factors in the original data set are considered meaningful4.

set.seed(187)
wm_df_prepped %>%
  paran::paran()
> 
> Using eigendecomposition of correlation matrix.
> Computing: 10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
> 
> 
> Results of Horn's Parallel Analysis for component retention
> 930 iterations, using the mean estimate
> 
> -------------------------------------------------- 
> Component   Adjusted    Unadjusted    Estimated 
>             Eigenvalue  Eigenvalue    Bias 
> -------------------------------------------------- 
> 1           6.447778    6.605264      0.157485
> 2           5.441013    5.579191      0.138178
> 3           2.412554    2.535613      0.123058
> 4           1.709277    1.819883      0.110606
> 5           1.486338    1.585225      0.098887
> 6           1.244675    1.332728      0.088052
> -------------------------------------------------- 
> 
> Adjusted eigenvalues > 1 indicate dimensions to retain.
> (6 components retained)

Horn’s Method suggests a number of 6 PCs to keep. I chose to keep 8 with approximately 69.61 % cumulative variance. Next, we take a look at the contributions of the original variables to each new PC.

n_PCs <- 8
for (i in 1:n_PCs) {
  factoextra::fviz_contrib(wm_df_pca, "var", axes = i) %>%
    print
}

The psych package comes with a function that can illustrate the contribution of each original variable to the PCs in one plot.

wm_df_prepped %>%
  psych::principal(nfactors = n_PCs) %>%
  psych::fa.diagram()

A new data set is created based on the new dimensions.

wm_df_transformed_pca <-
  as.data.frame(-wm_df_pca$x[,1:n_PCs])

Factor Analysis (FA)

Like the PCA a factor analysis can be used to reduce the dimensions of a data set. However, while the PCA creates uncorrelated variables the FA identifies underlying latent factors that explain the relationships among the original variables in the data set. The factors the FA puts out might be correlated, so a rotation can be used in make these factors as uncorrelated as possible.

First, a vector containing all rotation methods is created. Then we iterate over each of them using a for-loop.

options(width = 300)
rot_meth <- c("varimax",
              "quartimax",
              "equamax",
              "oblimin",
              "promax")
max_cumvar <- 0
for (rm in rot_meth) {
  cat("Factor Analysis results. Rotation method: ", rm, "\n")
  res <- factanal(wm_df_prepped,
                  factors = n_PCs,
                  rotation = rm,
                  lower = 0.1) %>% 
    steffanossaR::ex_factanal()
  if (res[3,n_PCs] > max_cumvar & res[3,n_PCs] <= 1) {max_cumvar <- res[3,n_PCs]}
  print(res)
  cat("\n")
}
> Factor Analysis results. Rotation method:  varimax 
>                  Factor1   Factor2    Factor3    Factor4    Factor5   Factor6    Factor7    Factor8
> SS loadings    5.2605001 4.2311596 2.43491023 1.72982899 1.47423168 1.3672520 1.17974164 1.16537040
> Proportion Var 0.1696936 0.1364890 0.07854549 0.05580094 0.04755586 0.0441049 0.03805618 0.03759259
> Cumulative Var 0.1696936 0.3061826 0.38472806 0.44052900 0.48808486 0.5321898 0.57024594 0.60783854
> 
> Factor Analysis results. Rotation method:  quartimax 
>                  Factor1   Factor2   Factor3    Factor4    Factor5    Factor6    Factor7   Factor8
> SS loadings    5.2339896 4.2670551 3.2393682 1.63982088 1.33641019 1.31312776 1.07467596 0.7385470
> Proportion Var 0.1688384 0.1376469 0.1044957 0.05289745 0.04311001 0.04235896 0.03466697 0.0238241
> Cumulative Var 0.1688384 0.3064853 0.4109811 0.46387851 0.50698851 0.54934747 0.58401444 0.6078385
> 
> Factor Analysis results. Rotation method:  equamax 
>                  Factor1    Factor2   Factor3    Factor4    Factor5    Factor6    Factor7    Factor8
> SS loadings    4.3650982 2.97044887 2.5651322 1.98388108 1.90522786 1.90121230 1.72566190 1.42633231
> Proportion Var 0.1408096 0.09582093 0.0827462 0.06399616 0.06145896 0.06132943 0.05566651 0.04601072
> Cumulative Var 0.1408096 0.23663055 0.3193767 0.38337291 0.44483188 0.50616130 0.56182782 0.60783854
> 
> Factor Analysis results. Rotation method:  oblimin 
>                 Factor1   Factor2   Factor3    Factor4    Factor5    Factor6    Factor7    Factor8
> SS loadings    4.438703 3.1594732 1.9144761 1.73885944 1.48121743 1.47968744 1.35645032 1.33117830
> Proportion Var 0.143184 0.1019185 0.0617573 0.05609224 0.04778121 0.04773185 0.04375646 0.04294124
> Cumulative Var 0.143184 0.2451025 0.3068598 0.36295201 0.41073321 0.45846507 0.50222153 0.54516276
> 
> Factor Analysis results. Rotation method:  promax 
>                  Factor1   Factor2   Factor3    Factor4    Factor5    Factor6    Factor7    Factor8
> SS loadings    5.0079686 3.6562588 3.6140890 1.72102808 1.47748204 1.45830140 1.44928675 0.91531511
> Proportion Var 0.1615474 0.1179438 0.1165835 0.05551703 0.04766071 0.04704198 0.04675119 0.02952629
> Cumulative Var 0.1615474 0.2794912 0.3960747 0.45159176 0.49925247 0.54629445 0.59304564 0.62257193

62.26 % is the maximum amount of cumulative variance with 8 factors. Approximately 10 % less than what the PCA yielded. Therefore PCA will be used.

Cluster Analysis

To assess the clustering tendency of a data set the Hopkins Statistic can be used. It measures the probability that a given data set was generated by a uniform data distribution. The higher the resulting value the better the clustering tendency. Values range from 0 to 15.

#wm_df_transformed_pca %>%
#  get_clust_tendency(n = nrow(wm_df_transformed_pca) - 1, graph = F)

~ 0.83 is quite good.

Hierarchical Clustering: Agglomerative Methods

Clustering can be done with different approaches. Agglomerative methods start with every observation being considered a single-elementcluster. Then the two most similar clusters are combined into a new cluster. This is done until there is one cluster containing everything. First, a distance matrix and then clusters are created. For both steps there are multiple methods of creation.

dist_meth <- c("euclidean",
               "maximum",
               "manhattan",
               "canberra",
               "minkowski")

daisy_meth <- c("euclidean",
                "manhattan",
                "gower")

hclust_meth <- c("single",
                 "complete",
                 "average",
                 "mcquitty",
                 "median",
                 "centroid",
                 "ward")

pdf("figs/dendro_dist_hclust.pdf", width = 16, height = 9)
for (dm in dist_meth) {
  for (cm in hclust_meth) {
    hdist <- dist(scale(wm_df_transformed_pca),
                  dm)
    hcl <- flashClust::flashClust(hdist, cm)
    plot(hcl,
         main = paste0("dist method: ", dm,
                       "\nclust method: ", cm),
         sub = NA,
         labels = FALSE)
  }
}
dev.off()
> png 
>   2
pdf("figs/dendro_daisy.pdf", width = 16, height = 9)
for (dm in daisy_meth) {
  for (cm in hclust_meth) {
    daisydist <- daisy(wm_df_transformed_pca,
                       metric = dm,
                       stand = TRUE)
    flashClust::flashClust(daisydist, cm) %>% 
      plot(main = paste0("dist method: ", dm,
                         "\nclust method: ", cm),
           sub = NA,
           labels = FALSE)
  }
}
dev.off()
> png 
>   2
#' *...*

All possible combinations can be found in figs/dendro_dist_hclust.pdf and figs/dendro_daisy.pdf. The dist + hclust combination of Canberra and ward.D2 looks most promising.

hclust_a <- 
  dist(scale(wm_df_transformed_pca), 
       method = "canberra") %>% 
  flashClust::flashClust(method = "ward")
 
hclust_a %>% plot(main = paste0("dist method: canberra",
                                "\nclust method: ward"),
                  sub = NA,
                  labels = FALSE)

The amount of viable clusters seems to range from 2 to 7. Microsoft Excel can be used to comfortably compare clusters for profiling and help determining the number of clusters to choose.

for (i in 2:7) {
  wm_df_prepped_clust_a <-
    wm_df_prepped %>%
    mutate(Cluster_a = cutree(hclust_a, k = i))
  
  profiling_df <-
    wm_df_prepped_clust_a %>% 
    group_by(Cluster_a) %>% 
    mutate(n_island_municipalities = sum(Inselgemeinde),
           Inselgemeinde = NULL,
           n_coastal_municipalities = sum(Kuestengemeinde),
           Kuestengemeinde = NULL) %>% 
    summarize_all(mean)
  profiling_df$n_Cluster <- i
  profiling_df$Size <- table(wm_df_prepped_clust_a$Cluster_a)
  
  if (i == 2) {
    profiling_df_final <- profiling_df
    # write to csv
    write.table(profiling_df_final %>%
                  filter(n_Cluster == i) %>% 
                  mutate(round(., 2)),
                file = "profiling_hclust_a.csv",
                sep = ";",
                dec = ",",
                row.names = F)
  } else {
    profiling_df_final <-
      profiling_df_final %>% 
      rows_append(profiling_df)
    
    # append rows with leading blank line to csv
    write.table("",
                file = "profiling_hclust_a.csv",
                sep = ";",
                append = T,
                col.names = F,
                dec = ",",
                row.names = F)
    write.table(profiling_df_final %>%
                  filter(n_Cluster == i) %>% 
                  mutate(round(., 2)),
                file = "profiling_hclust_a.csv",
                sep = ";",
                append = T,
                col.names = F,
                dec = ",",
                row.names = F)
  }
}

Using VBA, the differences of the means of the different clusters can be emphasised. I use this macro for highlighting.

Sub Highlighting()
    
    Dim data_end_col As Integer
    Dim cluster_group_col As Long
    Dim cluster_count As Integer
    Dim i As Integer
    Dim j As Integer
    Dim group_end_row As Integer
    
    ' Dim chr As String
    ' get col of cluster groups
    ' chr = InputBox("ColName of cluster groupings", "Text Input", "n_Cluster")
    ' cluster_group_col = Application.WorksheetFunction.Match(chr, Range(Cells(1, 1), Cells(1, data_end_col)), 0)
    
    i = 2
    j = 2
    data_end_col = Cells(1, Columns.Count).End(xlToLeft).Column
    cluster_group_col = Application.WorksheetFunction.Match("n_Cluster", Range(Cells(1, 1), Cells(1, data_end_col)), 0)
    
    ' loop through the number of clusters
    Do While i <= ActiveSheet.UsedRange.Rows.Count
        cluster_count = Cells(i, cluster_group_col)
        group_end_row = i + cluster_count - 1
        
        ' loop through dimensions
        For j = 2 To data_end_col
            Range(Cells(i, j), Cells(group_end_row, j)).Select
            
            ' add highlighting
            Selection.FormatConditions.AddColorScale ColorScaleType:=3
            Selection.FormatConditions(Selection.FormatConditions.Count).SetFirstPriority
            Selection.FormatConditions(1).ColorScaleCriteria(1).Type = _
                xlConditionValueLowestValue
            With Selection.FormatConditions(1).ColorScaleCriteria(1).FormatColor
                .Color = 13011546
                .TintAndShade = 0
            End With
            Selection.FormatConditions(1).ColorScaleCriteria(2).Type = _
                xlConditionValuePercentile
            Selection.FormatConditions(1).ColorScaleCriteria(2).Value = 50
            With Selection.FormatConditions(1).ColorScaleCriteria(2).FormatColor
                .Color = 16776444
                .TintAndShade = 0
            End With
            Selection.FormatConditions(1).ColorScaleCriteria(3).Type = _
                xlConditionValueHighestValue
            With Selection.FormatConditions(1).ColorScaleCriteria(3).FormatColor
                .Color = 7039480
                .TintAndShade = 0
            End With
            
        Next j
        i = group_end_row + 2
        
    Loop
End Sub

Hierarchical Clustering: Divisive Methods

Unlike with the agglomerative approach, the divisive method begins with a single cluster containing every observation and with each step existing clusters are divided into smaller ones until therea are as many clusters as observations.

diana <- diana(wm_df_transformed_pca, metric = "euclidean", stand = TRUE)
pltree(diana)

to be continued


  1. Municipal waste is all waste collected and treated by or for municipalities. It includes waste from households, including bulky waste, similar waste from trade and commerce, office buildings, institutions and small businesses, as well as yard and garden waste, street sweepings and the contents of waste containers. The definition excludes waste from municipal sewage systems and their treatment, as well as waste from construction and demolition work.↩︎

  2. Reference: https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm↩︎

  3. Reference: https://www.empirical-methods.hslu.ch/entscheidbaum/interdependenzanalyse/reduktion-der-variablen/faktoranalyse/↩︎

  4. Reference: doi:10.1007/bf02289447↩︎

  5. Reference: https://www.datanovia.com/en/lessons/assessing-clustering-tendency/↩︎